Assignment 01

Author

Carissa Feliciano

1. Prepare the data

Read in the data

library(data.table)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Ca2002 <- data.table::fread(file.path("~/Downloads/PM2.5_Ca_2002.csv"))
Ca2022 <- data.table::fread(file.path("~/Downloads/PM2.5_Ca_2022.csv"))

Checking the PM2.5 California 2002 Dataset

Check the size of the data.

dim(Ca2002)
[1] 15976    22

Look at the top and bottom of the data.

head(Ca2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/05/2002    AQS 60010007     1                           25.1 ug/m3 LC
2: 01/06/2002    AQS 60010007     1                           31.6 ug/m3 LC
3: 01/08/2002    AQS 60010007     1                           21.4 ug/m3 LC
4: 01/11/2002    AQS 60010007     1                           25.9 ug/m3 LC
5: 01/14/2002    AQS 60010007     1                           34.5 ug/m3 LC
6: 01/17/2002    AQS 60010007     1                           41.0 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              81       Livermore               1              100
2:              93       Livermore               1              100
3:              74       Livermore               1              100
4:              82       Livermore               1              100
5:              98       Livermore               1              100
6:             115       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         120
2:              88101  PM2.5 - Local Conditions         120
3:              88101  PM2.5 - Local Conditions         120
4:              88101  PM2.5 - Local Conditions         120
5:              88101  PM2.5 - Local Conditions         120
6:              88101  PM2.5 - Local Conditions         120
                      Method Description CBSA Code
                                  <char>     <int>
1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
tail(Ca2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/10/2002    AQS 61131003     1                             15 ug/m3 LC
2: 12/13/2002    AQS 61131003     1                             15 ug/m3 LC
3: 12/22/2002    AQS 61131003     1                              1 ug/m3 LC
4: 12/25/2002    AQS 61131003     1                             23 ug/m3 LC
5: 12/28/2002    AQS 61131003     1                              5 ug/m3 LC
6: 12/31/2002    AQS 61131003     1                              6 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              62 Woodland-Gibson Road               1              100
2:              62 Woodland-Gibson Road               1              100
3:               6 Woodland-Gibson Road               1              100
4:              77 Woodland-Gibson Road               1              100
5:              28 Woodland-Gibson Road               1              100
6:              33 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         117
2:              88101  PM2.5 - Local Conditions         117
3:              88101  PM2.5 - Local Conditions         117
4:              88101  PM2.5 - Local Conditions         117
5:              88101  PM2.5 - Local Conditions         117
6:              88101  PM2.5 - Local Conditions         117
                      Method Description CBSA Code
                                  <char>     <int>
1: R & P Model 2000 PM2.5 Sampler w/WINS     40900
2: R & P Model 2000 PM2.5 Sampler w/WINS     40900
3: R & P Model 2000 PM2.5 Sampler w/WINS     40900
4: R & P Model 2000 PM2.5 Sampler w/WINS     40900
5: R & P Model 2000 PM2.5 Sampler w/WINS     40900
6: R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327

Examine the variable names and variable types.

str(Ca2002)
Classes 'data.table' and 'data.frame':  15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(Ca2002$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
sum(is.na(Ca2002$`Daily Mean PM2.5 Concentration`))
[1] 0
sum(Ca2002$`Daily Mean PM2.5 Concentration` == "")
[1] 0
hist(Ca2002$`Daily Mean PM2.5 Concentration`)

Summary of the PM 2.5 California 2000 Dataset

In the 2002 dataset of daily average PM2.5 concentrations at all sites in California, there are 15,976 rows and 22 columns. There are 15,976 observations and 22 variables. There are no missing values for the daily mean PM2.5 concentration. There are no observations that are labeled as NA, ““, 999, or 9999. For the daily mean PM2.5 concentration, the range is 0-104.30 ug/m^3, which is plausible. There appears to be no major issues with the data.

Checking the PM2.5 California 2022 Dataset

Check the size of the data.

dim(Ca2022)
[1] 59756    22

Look at the top and bottom of the data.

head(Ca2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/01/2022    AQS 60010007     3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007     3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007     3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007     3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007     3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007     3                            3.8 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              58       Livermore               1              100
2:              60       Livermore               1              100
3:              39       Livermore               1              100
4:              21       Livermore               1              100
5:              23       Livermore               1              100
6:              21       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         170
2:              88101  PM2.5 - Local Conditions         170
3:              88101  PM2.5 - Local Conditions         170
4:              88101  PM2.5 - Local Conditions         170
5:              88101  PM2.5 - Local Conditions         170
6:              88101  PM2.5 - Local Conditions         170
                     Method Description CBSA Code
                                 <char>     <int>
1: Met One BAM-1020 Mass Monitor w/VSCC     41860
2: Met One BAM-1020 Mass Monitor w/VSCC     41860
3: Met One BAM-1020 Mass Monitor w/VSCC     41860
4: Met One BAM-1020 Mass Monitor w/VSCC     41860
5: Met One BAM-1020 Mass Monitor w/VSCC     41860
6: Met One BAM-1020 Mass Monitor w/VSCC     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
tail(Ca2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/01/2022    AQS 61131003     1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003     1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003     1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003     1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003     1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003     1                            1.0 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              19 Woodland-Gibson Road               1              100
2:              21 Woodland-Gibson Road               1              100
3:              33 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              77 Woodland-Gibson Road               1              100
6:               6 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         145
2:              88101  PM2.5 - Local Conditions         145
3:              88101  PM2.5 - Local Conditions         145
4:              88101  PM2.5 - Local Conditions         145
5:              88101  PM2.5 - Local Conditions         145
6:              88101  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327

Look at the variables.

str(Ca2022)
Classes 'data.table' and 'data.frame':  59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(Ca2022$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.700   4.100   6.800   8.429  10.700 302.500 
sum(is.na(Ca2022$`Daily Mean PM2.5 Concentration`))
[1] 0
sum(Ca2022$`Daily Mean PM2.5 Concentration` == "")
[1] 0
hist(Ca2022$`Daily Mean PM2.5 Concentration`)

Summary of the PM2.5 California 2022 Dataset

In the 2022 dataset of daily average PM2.5 concentrations at all sites in California, there are 59,756 rows and 22 columns. There are 59,756 observations and 22 variables. There are no missing values for the daily mean PM2.5 concentration. There are no observations that are labeled as NA, ““, 999, or 9999.

For daily mean PM2.5 concentrations, the range is -6.7 to 302.5 ug/m^3. Technically, the minimum concentration should be 0, since it is not possible to have a negative amount of particles in the air. However, according to the EPA, vaild negative numbers should be included in reporting to databases (https://www.epa.gov/sites/default/files/2016-10/documents/pm2.5_continuous_monitoring.pdf). The AQS generally allows negative data up to -10 ug/m^3. Therefore, I will leave the negative values in this database. The maximum is within the range of plausible values.

2. Combine the two years of data into one data frame, create date variable, and change the variable names.

Combine the two years of data into one data frame.

combined_ca <- rbind(Ca2002, Ca2022)

Use the Date variable to create a new column for year.

combined_ca$Date <- as.Date(combined_ca$Date, format = "%m/%d/%Y")
combined_ca$Year <- format(combined_ca$Date, "%Y")

Change the names of the key variables so they are easier to refer to.

library(dplyr)
combined_ca <- combined_ca |>
    rename(PM2.5 = `Daily Mean PM2.5 Concentration`,
    lat = `Site Latitude`,
    lon = `Site Longitude`)

3. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

pm_stations <- (unique(combined_ca[,c("lat","lon", "Year", "Local Site Name")]))
table(pm_stations$Year)

2002 2022 
 103  167 
library(leaflet)
library(leaflet.extras)

year.pal <- colorFactor(c("red", "blue"), domain = pm_stations$Year)

leaflet(pm_stations) |>
  addTiles() |>
  addCircles(
    lat = ~lat, lng = ~lon,
    color = ~year.pal(Year),
    label = ~paste("", `Year`, "", `Local Site Name`),
    opacity = 0.5, fillOpacity = 0.3, radius = 400) |>
  addLegend(
    "bottomleft",
    pal = year.pal,
    values = ~Year,
    title = "Year",
    opacity = 1
    )

There are larger clusters of monitoring sites near Los Angeles, San Francisco, and Sacramento. It appears that there are more monitoring sites on the Western side of California compared to the Eastern side. The majority of the monitoring sites in the 2002 database were also listed in the 2022 database. The additional sites that were only in the 2022 database are scattered throughout California.

4. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

Checking for any missing or implausible values in the combined dataset.

summary(combined_ca$PM2.5)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -6.70    4.50    7.60   10.05   12.20  302.50 
sum(combined_ca$PM2.5 == "")
[1] 0
sum(is.na(combined_ca$PM2.5))
[1] 0
combined_ca <- combined_ca[order(combined_ca$PM2.5), ]
head(combined_ca)
         Date Source  Site ID   POC PM2.5    Units Daily AQI Value
       <Date> <char>    <int> <int> <num>   <char>           <int>
1: 2022-09-20    AQS 60571001     5  -6.7 ug/m3 LC               0
2: 2022-09-19    AQS 60571001     5  -6.3 ug/m3 LC               0
3: 2022-09-21    AQS 60571001     5  -5.1 ug/m3 LC               0
4: 2022-09-03    AQS 60571001     5  -4.7 ug/m3 LC               0
5: 2022-09-22    AQS 60571001     5  -4.7 ug/m3 LC               0
6: 2022-09-04    AQS 60571001     5  -4.1 ug/m3 LC               0
        Local Site Name Daily Obs Count Percent Complete AQS Parameter Code
                 <char>           <int>            <num>              <int>
1: Truckee-Fire Station               1              100              88502
2: Truckee-Fire Station               1              100              88502
3: Truckee-Fire Station               1              100              88502
4: Truckee-Fire Station               1              100              88502
5: Truckee-Fire Station               1              100              88502
6: Truckee-Fire Station               1              100              88502
                AQS Parameter Description Method Code       Method Description
                                   <char>       <int>                   <char>
1: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
2: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
3: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
4: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
5: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
6: Acceptable PM2.5 AQI & Speciation Mass         733 Met-One BAM W/PM2.5 VSCC
   CBSA Code                CBSA Name State FIPS Code      State
       <int>                   <char>           <int>     <char>
1:     46020 Truckee-Grass Valley, CA               6 California
2:     46020 Truckee-Grass Valley, CA               6 California
3:     46020 Truckee-Grass Valley, CA               6 California
4:     46020 Truckee-Grass Valley, CA               6 California
5:     46020 Truckee-Grass Valley, CA               6 California
6:     46020 Truckee-Grass Valley, CA               6 California
   County FIPS Code County      lat       lon   Year
              <int> <char>    <num>     <num> <char>
1:               57 Nevada 39.32783 -120.1846   2022
2:               57 Nevada 39.32783 -120.1846   2022
3:               57 Nevada 39.32783 -120.1846   2022
4:               57 Nevada 39.32783 -120.1846   2022
5:               57 Nevada 39.32783 -120.1846   2022
6:               57 Nevada 39.32783 -120.1846   2022
tail(combined_ca)
         Date Source  Site ID   POC PM2.5    Units Daily AQI Value
       <Date> <char>    <int> <int> <num>   <char>           <int>
1: 2022-09-10    AQS 60570005     3 218.2 ug/m3 LC             293
2: 2022-09-10    AQS 60610004     3 243.9 ug/m3 LC             338
3: 2022-08-15    AQS 61050002     1 244.7 ug/m3 LC             339
4: 2022-08-14    AQS 61050002     1 246.2 ug/m3 LC             342
5: 2022-09-16    AQS 60611004     3 296.3 ug/m3 LC             442
6: 2022-07-31    AQS 60932001     3 302.5 ug/m3 LC             454
                Local Site Name Daily Obs Count Percent Complete
                         <char>           <int>            <num>
1: Grass Valley-Litton Building               1              100
2:             Colfax-City Hall               1              100
3:       Weaverville-Courthouse               1              100
4:       Weaverville-Courthouse               1              100
5:     Tahoe City-Fairway Drive               1              100
6:                        Yreka               1              100
   AQS Parameter Code              AQS Parameter Description Method Code
                <int>                                 <char>       <int>
1:              88101               PM2.5 - Local Conditions         209
2:              88502 Acceptable PM2.5 AQI & Speciation Mass         731
3:              88502 Acceptable PM2.5 AQI & Speciation Mass         731
4:              88502 Acceptable PM2.5 AQI & Speciation Mass         731
5:              88502 Acceptable PM2.5 AQI & Speciation Mass         731
6:              88101               PM2.5 - Local Conditions         170
                                   Method Description CBSA Code
                                               <char>     <int>
1: Met One BAM-1022 Mass Monitor w/ VSCC or TE-PM2.5C     46020
2:                       Met-One BAM-1020 W/PM2.5 SCC     40900
3:                       Met-One BAM-1020 W/PM2.5 SCC        NA
4:                       Met-One BAM-1020 W/PM2.5 SCC        NA
5:                       Met-One BAM-1020 W/PM2.5 SCC     40900
6:               Met One BAM-1020 Mass Monitor w/VSCC        NA
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1:                Truckee-Grass Valley, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3:                                                       6 California
4:                                                       6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6:                                                       6 California
   County FIPS Code   County      lat       lon   Year
              <int>   <char>    <num>     <num> <char>
1:               57   Nevada 39.23348 -121.0556   2022
2:               61   Placer 39.10017 -120.9538   2022
3:              105  Trinity 40.73475 -122.9412   2022
4:              105  Trinity 40.73475 -122.9412   2022
5:               61   Placer 39.16602 -120.1488   2022
6:               93 Siskiyou 41.72689 -122.6336   2022
hist(combined_ca$PM2.5)

There are no missing values of PM2.5 in the combined dataset.

The range of daily average PM2.5 concentrations is -6.70 to 302.50. As mentioned above, technically, the minimum concentration should be 0, since it is not possible to have a negative amount of particles in the air. However, according to the EPA, vaild negative numbers should be included in reporting to databases. The AQS generally allows negative data up to -10 ug/m^3.

The max PM2.5 value is 302.5, which was recorded on 07/31/2022 in Yreka, Ca. This value seems plausible, as there was a large fire, the McKinney Fire, in Yreka on 07/31/2022.

Explore the proportions of missing values and implausible values, and provide a summary of any temporal patterns you see in these observations.

In this case, I am assuming that negative values are implausible.

mean(is.na(combined_ca$PM2.5))
[1] 0

The proportion of PM2.5 concentration values that are missing is 0%.

mean(combined_ca$PM2.5 <0, na.rm = TRUE)
[1] 0.002838958

The proportion of PM2.5 concentration values less than 0 is 0.28%. This is a very low percentage, and I am not certain these values are implausible. Therefore, I will leave them in the dataset.

library(ggplot2)
combined_ca[combined_ca$Year == 2002, ] |>
  ggplot()+
  geom_point(mapping = aes(x = Date, y = PM2.5))+
  labs(x = "Date", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations in California, 2002")

combined_ca[combined_ca$Year == 2022, ] |>
  ggplot()+
  geom_point(mapping = aes(x = Date, y = PM2.5))+
  labs(x = "Date", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations in California, 2022")

These scatterplots above demonstrate that there are daily average PM2.5 concentration values recorded for every day of the year in both 2002 and 2022.

combined_ca |> filter(PM2.5 < 0) |>
  ggplot()+
  geom_point(mapping = aes(x = Date, y = PM2.5))+
  labs(x = "Date", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Negative Daily Average PM2.5 Concentrations in California, 2002 and 2022")

This scatterplot allows us to better visualize when negative daily average PM2.5 concentrations were recorded during 2002 and 2022. There were no negative PM2.5 concentrations recorded for 2002. In 2022, there were negative PM2.5 values recorded throughout the year, but the largest negative PM2.5 concentrations were recorded between September-October 2022.

5. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Write up explanations of what you observe in these data.

State: California

combined_ca_avg <- combined_ca |>
  group_by(Date) |>
  summarize(
    PM2.5_avg = mean(PM2.5, na.rm = TRUE),
    Year = unique(Year)
  )
nrow(combined_ca_avg)
[1] 730

The daily average PM2.5 concentrations for all sites in California were averaged to generate a daily average PM2.5 concentration for California for each day of the year. This dataset was then used to generate the following graphs.

ggplot(combined_ca_avg)+
  geom_boxplot(mapping = aes(x = Year, y = PM2.5_avg, fill = Year))+
  labs(x = "Year", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for California, 2002 vs 2022")

Overall, the daily average PM2.5 concentrations of California were lower in 2022 compared to 2002. The median daily average PM2.5 concentration of California was approximately 16 ug/m^3 in 2002 and 8 ug/m^3 in 2022. The maximum mean daily PM2.5 concentration, excluding outliers, was approximately 36 ug/m^3 in 2002 and 15 ug/m^3 in 2022. The highest outlier was approximately 50 ug/m^3 in 2002 and 20 ug/m^3 in 2022.

ggplot(combined_ca_avg)+
  geom_histogram(mapping = aes(x = PM2.5_avg, fill = Year), color = "dimgrey", binwidth = 2, position = "identity", alpha = 0.6)+
  labs(x = "Daily Average PM2.5 Concentration (ug/m^3)", y = "Number of Days", title = "Daily Average PM2.5 Concentrations for California, 2002 vs 2022")

Based on this histogram, it appears that the daily average PM2.5 concentrations for California have decreased from 2002 to 2022. The distribution of daily average PM2.5 concentrations for California in 2002 was right-skewed with a peak at 14 ug/m^3. The distribution of daily average PM2.5 concentrations for California in 2022 was slightly right-skewed with a peak at 8 ug/m^3. The range of daily average PM2.5 concentrations was approximately 4-51 ug/m^3 in 2002 and 3-19 ug/m^3 in 2022.

ggplot(data = combined_ca_avg |>
         mutate(Date = as.Date(format(Date, "2000-%m-%d"))))+
  geom_line(mapping = aes(x = Date, y = PM2.5_avg, color = Year))+
  scale_x_date(date_breaks = "1 month", date_labels = "%b")+
  labs(x = "Month", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for California, 2002 vs 2022")

The daily average PM2.5 concentrations for California were generally lower in 2022 compared to 2002 for all months of the year. In 2002, the daily average PM2.5 concentrations ranged from approximately 5 ug/m^3 to 50 ug/m^3. In 2022, the daily average PM2.5 concentrations ranged from approximately 3 ug/m^3 to 19 ug/m^3. In 2002, the daily average PM2.5 concentrations for California were highest in November-December. In 2022, the daily average PM2.5 concentrations for California were highest in September.

Summary statistics of PM2.5 concentration, by year, across all sites in California

combined_ca |>
  summarize(
    Count = n(),
    Mean = mean(PM2.5, na.rm = TRUE),
    Median = median(PM2.5, na.rm = TRUE),
    Min = min(PM2.5, na.rm = TRUE),
    Max = max(PM2.5, na.rm = TRUE),
    SD = sd(PM2.5, na.rm = TRUE),
    .by = c(Year)
  )
  Year Count      Mean Median  Min   Max        SD
1 2022 59756  8.428595    6.8 -6.7 302.5  7.644274
2 2002 15976 16.115943   12.0  0.0 104.3 13.867372

These statistics were generated from a dataset containing the daily average PM2.5 concentrations for all sites in California from 2002 and 2022. It appears that the daily concentrations of PM2.5 have decreased in California from 2002 to 2022. The median daily average PM2.5 concentration across all sites in California was 12 ug/m^3 in 2002 and 6.8 ug/m^3 in 2022. The maximum daily average PM2.5 concentration was 104.3 ug/m^3 in 2002 and 302.5 ug/m^3 in 2022. While the maximum daily average PM2.5 concentration was greater in 2022, the majority of daily average PM2.5 concentration values are lower in 2022 compared to 2002.

County: Los Angeles County

combined_LAC <- combined_ca[combined_ca$County=="Los Angeles", ]

combined_LAC_avg <- combined_LAC |>
  group_by(Date) |>
  summarize(
    PM2.5_avg = mean(PM2.5, na.rm = TRUE),
    Year = unique(Year)
  )

The daily average PM2.5 concentrations for all sites in Los Angeles County (LAC) were averaged to generate a daily average PM2.5 concentration for LAC for each day of the year. This dataset was then used to generate the following graphs.

ggplot(combined_LAC_avg)+
  geom_boxplot(mapping = aes(x = Year, y = PM2.5_avg, fill = Year))+
  labs(x = "Year", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for Los Angeles County, 2002 vs 2022")

Generally, the daily average PM2.5 concentrations for Los Angeles County (LAC) were lower in 2022 compared to 2002. The median daily average PM2.5 concentration for LAC was approximately 18 ug/m^3 in 2002 and 11 ug/m^3 in 2022. The maximum daily average PM2.5 concentration, excluding outliers, for LAC was approximately 43 ug/m^3 in 2002 and 20 ug/m^3 in 2022. The interquartile range was narrower in 2022 compared to 2002.

ggplot(combined_LAC_avg)+
  geom_histogram(mapping = aes(x = PM2.5_avg, fill = Year), color = "dimgrey", binwidth = 2, position = "identity", alpha = 0.6)+
  labs(x = "Daily Average PM2.5 Concentration (ug/m&3)", y = "Number of Days", title = "Daily Average PM2.5 Concentrations for Los Angeles County, 2002 vs 2022")

The daily average PM2.5 concentrations for Los Angeles County (LAC) were generally lower in 2022 compared to 2002. The distribution of daily average PM2.5 concentrations for LAC in 2002 was right-skewed with a peak at approximately 16 ug/m^3 and a second peak at 23 ug/m^3. The distribution of daily average PM2.5 concentrations for LAC in 2022 was slightly right-skewed distribution with a peak at 11 ug/m^3.

ggplot(data = combined_LAC_avg |>
         mutate(Date = as.Date(format(Date, "2000-%m-%d"))))+
  geom_line(mapping = aes(x = Date, y = PM2.5_avg, color = Year))+
  scale_x_date(date_breaks = "1 month", date_labels = "%b")+
  labs(x = "Date", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for Los Angeles County, 2002 vs 2022")

In general, the daily average PM2.5 concentrations of Los Angeles County were lower in 2022 compared to 2002 for all months of the year. The difference in PM2.5 concentrations was greatest for the months of October and December when comparing 2002 to 2022. The range of daily average PM2.5 concentrations was approximately 5-58 ug/m^3 in 2002 and 3-26 ug/m^3 in 2022.

combined_LAC |>
  summarize(
    Count = n(),
    Mean = mean(PM2.5, na.rm = TRUE),
    Median = median(PM2.5, na.rm = TRUE),
    Min = min(PM2.5, na.rm = TRUE),
    Max = max(PM2.5, na.rm = TRUE),
    SD = sd(PM2.5, na.rm = TRUE),
    .by = c(Year)
  )
  Year Count     Mean Median  Min  Max        SD
1 2022  5070 10.97164   10.3 -1.2 56.0  5.238462
2 2002  1879 19.65604   17.4  0.6 72.4 11.884042

These statistics were generated from a dataset containing the daily average PM2.5 concentrations for all sites in Los Angeles County from 2002 and 2022. The daily average PM2.5 concentrations in Los Angeles County were lower in 2022 compared to 2002. The median daily average PM2.5 concentration across all sites in LAC was 17.4 ug/m^3 in 2002 and 10.3 ug/m^3 in 2022. The maximum daily average PM2.5 concentration was 72.4 ug/m^3 in 2002 and 56 ug/m^3.

Site in Los Angeles: Pasadena

combined_pas <- combined_ca[combined_ca$`Local Site Name`=="Pasadena", ]
ggplot(combined_pas)+
  geom_boxplot(mapping = aes(x = Year, y = PM2.5, fill = Year))+
  labs(x = "Year", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for Pasadena, CA, 2002 vs 2022")

Overall, the daily average PM2.5 concentrations for the Pasadena site are lower in 2022 compared to 2002. The median daily average PM2.5 concentration was approximately 18 ug/m^3 in 2002 and 8 ug/m^3 in 2022. The maximum daily average PM2.5 concentration, excluding outliers, was approximately 45 ug/m^3 in 2002 and 19 ug/m^3 in 2022.

ggplot(combined_pas)+
  geom_histogram(mapping = aes(x = PM2.5, fill = Year), color = "dimgrey", binwidth = 2, position = "identity", alpha = 0.6)+
  labs(x = "Daily Average PM2.5 Concentration (ug/m^3)", y = "Number of Days", title = "Daily Average PM2.5 Concentrations for Pasadena, CA, 2002 vs 2022")

Overall, the daily average PM2.5 concentrations at the Pasadena site were lower in 2022 compared to 2002. The distribution of the daily average PM2.5 concentrations for 2002 is right-skewed with a long right tail and peak at approximately 12 ug/m^3. The distribution of the daily average PM2.5 concentrations for 2022 is slightly right-skewed with a peak at 6 ug/m^3. Based on this graph, the range was approximately 3-59 ug/m^3 in 2002 and 3-23 ug/m^3 in 2022.

ggplot(data = combined_pas |>
         mutate(Date = as.Date(format(Date, "2000-%m-%d"))))+
  geom_line(mapping = aes(x = Date, y = PM2.5, color = Year))+
  scale_x_date(date_breaks = "1 month", date_labels = "%b")+
  labs(x = "Date", y = "Daily Average PM2.5 Concentration (ug/m^3)", title = "Daily Average PM2.5 Concentrations for Pasadena, CA, 2002 vs 2022")

Generally, the daily average PM2.5 concentrations were lower in 2022 compared to 2002 for all months of the year. Based on this graph, the range of daily average PM2.5 concentration values was approximately 4-58 ug/m^3 in 2002 and 4-22 ug/m^3 in 2022. The differences in PM2.5 concentrations were smallest during the months of May and June.

combined_pas |>
  summarize(
    Count = n(),
    Mean = mean(PM2.5, na.rm = TRUE),
    Median = median(PM2.5, na.rm = TRUE),
    Min = min(PM2.5, na.rm = TRUE),
    Max = max(PM2.5, na.rm = TRUE),
    SD = sd(PM2.5, na.rm = TRUE),
    .by = c(Year)
  )
  Year Count      Mean Median Min  Max        SD
1 2022   120  9.094167    7.9 3.5 22.1  3.679726
2 2002   121 20.290909   17.8 4.0 57.8 11.143085

The daily average PM2.5 concentrations at the Pasadena, CA site were lower in 2022 compared to 2002. The median daily average PM2.5 concentration was 17.8 ug/m^3 in 2002 and 7.9 ug/m^3 in 2022. The maximum daily average PM2.5 concentration was 57.8 ug/m^3 in 2002 and 22.1 ug/m^3 in 2022.